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Introduction 



The potential for computer adaptive testing (CAT) has been well documented. In order to improve 
the efficiency of this process, it may be possible to utilize a neural network, or more specifically, a 
back propagation neural network. In order to accomplish this end, it must be shown that grouping 
examinees by ability as determined by a two-parameter logistic model (or any other item response 
theory (IRT) model) may be replicated by asing a neural net. 

Two assumptions underlie the application of the logistic model: (1) the performance of an examinee 
on an item can be explained by a set of latent traits and (2) the relationship between performance of 
the examinees and the set of latent traits can be described by a monotonically increasing function 
(Hamilton, Swaminathan, and Rogers; 1991). The two parameter logistic function is of the form 

where 6 is the ability of the examinee, a is the item discrimination parameter, b is the item difficulty 
parameter, and D is a scaling factor. It should be noted that in this form the numerator and 
denominator of the function have been divided by e Da ' i9 ~ b ^(scc Hamilton, et al; 1991 ). The 
properties that make this model useful in testing are that the item parameters are invariant with 
respect to the examinee group used to calibrate them and the estimate of an examinee's ability, 0, is 
not dependent upon the items selected to measure this ability. It is also assumed that there is one 
dominant dimension of the trait being measured. 

A neural net consists of multiple layers of nodes interconnected by paths leading from each of the 
nodes in one layer to each of the nodes in the next higher level of the net. The path from a node at a 
particular layer to a node at the next higher layer is weighted. This weight is an indication of the 
strength of association between the nodes. The lowest level of the network is called the input layer, 
the intermediate levels are known as the hidden layers, and the highest level is called the output layer. 
A neural net is trained with complete data in anticipation that the weights, and therefore the 
parameters determining the weights, stabilize (Gustafson, 1989). 

The process of establishing these weights is called 'training' the neural net. This training takes place 
using a data set for which the outcomes are known. In this paper the net is trained asing data for 
examinees, who have been grouped by a known performance level. The responses to each item on 
an examination become the; input layer, with each item representing a node in this input layer. Each 
node in the output layer represents a different performance level group, where each examinee is 
placed in but one of these groups. The number of hidden layers may be varied, as may the number 
ot nodes in these hidden layers. A diagram of a neural net is provided in Figure 1 . 



Insert Figure 1 about here 



A mathematical function is defined at each node where the domain of the function is the vector 
of weights from the next lower layer or the input values for the input layer. The range of this 
function is usually real numbers between zero and one. One commonly used flxnction is the 
sigmoid function which can be written 

This function approaches zero as z becomes negatively infinite. As z becomes positively 
infinite, the function approaches one. The derivative of this function is 
F (z) = cF(z) [1 - F(z)]. Therefore, the rate of change of the function is parabolic with respect 
to F(z). 

In a three layer neural net with an input layer i; one hidden layer, j; and an output layer , the 
weights w, between the nodes on the i layer and the nodes on the j layer must be calculated, as 
must the weights, w^, between the nodes on the j layer and the nodes on the k layer. These 
weights are calculated so as to minimize the global error 




, where dt is the desired outcome and yk is the actual outcome 



produced by the neural network at the k, or output, layer. The weights w^ and w^ are 

adjusted through the sigmoid function, F(z) described earlier, where 

z } --Y, w v y, and z k = Ew , . This adjustment A w Jlc (the change in the weights between 

i ; 

the j laver and the k layer), is derived using the gradient descent technique as shown in Figure 
2) 



Insert Figure 2 about here 



After the change in weights, A w ;k , between the j layer and k layer is derived the result may be 
used to deriw A w$ , the change in weights between the i layer and the j layer. This derivation 
is presented in Figure 3. 



Insert Figure 3 about here 
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It is important to note that the change in the weights between the layers depends upon the 
output from the higher layers. This is the reason this model is called the back propagation 
neural net: the changes propagate back from the higher levels. 

Although this is apparently the first application of neural net technology to CAT data, neural 
net models have been used in place of the more traditional statistical procedures for 
classification problems.. The results for studies by Scalia, Marconi, Ridelia, Arrigo, Mansi, 
and Mela (1989); Rosenblatt Lelu, and Georgel (1989) and Nelson and Nefif (1991) indicate 
the feasibility of this approach. In each case, an attempt is made to stabilize the net. 

The data used for this study were collected as part of the 1993 Summer Orientation program 
conducted for incoming freshmen at a fairly large midwestern university (undergraduate 
enrollment of about 17,000 students). A mathematics test consisting of 40 five-respoase 
multiple-choice type items was administered to 1,615 freshmen. The test results are used to 
place students into appropriate mathematics courses suitable for satisfying the mathematics 
requirement for graduation. 

Procedures 

The process the researchers followed involved several steps. The first step was to attempt to 
determine the existence of a dominant factor (rather than strict unidimensionality). This was 
accomplished through factor analyses consisting of a principal components analysis and a 
principal axis solution for which a Varim&x rotation was used. The Scree Test was used to 
help in determining the number of factors present for the data. The procedure Factor of 
SPSS, Release 4.1, for the VAX/VMS was used for the factor analyses. Problems associated 
with this means of testing for a dominant factor have been studied and different procedures for 
testing have been compared (see Nandukumar, 1994). Procedures like those discussed there 
were not used for this study. 

The second step involved using the BICAL3 program to obtain the ability estimates for the 
examinees and the difficulty and discrimination coefficients for the test items. After the first 
run, items which did not fit the model as determined by a t test and examinees whose scores 
did not fit the model, again determined by a t test, were removed from a subsequent run used 
to calibrate the test items. The estimate of the ability of each subject was noted. Those 
examinees whose ability levels were one standard deviation or more below the mean were 
placed in ability group one. Examinees who scored between one standard deviation below the 
mean and the mean were placed in ability group two. Examinees who scored between the 
mean and one standard deviation above the mean were placed in ability group three. 
Performance level group 4 consisted of those examinees who scored more than one standard 
deviation above the mean. 
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The third step in this investigation was to remove a sample from the group to train the neural 
net. The neural net software used was NeuralWorks Professional II/Plus running under MS- 
DOS 6.2 for Windows (Neuralware, 1993). The response pattern lor the item was used as 
input, and the group placement determined from the BICAL3 program became the desired 
output. After the neural net was trained, the weights were saved. 

Step four involved using samples of examinees for which the response pattern and the group 
placement were known but samples that had not been used in training the net. The neural net 
was applied to these samples using weights calculated from step three. The success ratio, the 
ratio of the number of times the neural net placed an examinee in the correct group to the 
number of examinees in the group was calculated. 



Results 

The results of the principal components analysis identified the potential for the existence of 
three or four factors. The Scree test along with a comparison of different principal axis 
solutions led to the interpretation of a three factor solution. The results of the principal 
components analysis are reported in Table 1. 



Insert Table 1 about here 



Although there were eight eigenvalues that were equal to or greater than 1 .00, only three 
appeared to represent interpretable factors. The principal axis analysis for a three-factor 
solution yielded the eigenvalue information reported in Table 2. 



Insert Table 2 about here 



In observing the factor structure resulting from the Varimax rotation, it was determined that 1 8 
of the 40 items loaded primarily on Factor 1, with 14 of the 1 8 loadings being greater than .30: 
twelve of the items loaded primarily on Factor 2, with 10 of the 12 loadings being greater than 
.30; and 10 of the items loaded primarily on Factor 3 with three of the ten items having 
loadings greater than .30. Five of the items belonging to Factor 2 also had loadings on Factor 
1 that were greater than .25. Additionally, one other item belonging to Factor 2 loaded on 
Factor 3. 

On the basis of the factor analyses results, it was concluded that although the factor analyses 
did not produce a strong solution and there are probably two weak but interpretable secondary 
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factors, the evidence does support (in a limited way) the existence of a dominant dimension 
(see Hamilton et al., 1991). 

The BICAL3 procedure was initially calibrated on scores ranging between 40% and 90°o 
correct for the 40-itcm test. Nineteen examinees had scores above 36 and 297 had scores 
below 16. Therefore, this calibration of the test was performed with 1,299 examinees. A 
decision was made to remove items for which the t test statistic had an absolute value greater 
than 2.00. Similarly, examinees were removed from the calibration if the absolute value of the 
corresponding t test statistic for testing the fit of an examinee to the model was greater than 
1.50. Items 6, 8, 19, 27, and 39 were removed as a result of the poor fit. Seventy-two 
examinees were removed as a result of not fitting the model. Five additional examinees were 
removed from the data set because they lacked identification numbers. A reanalysis resulted in 
the removal of six additional items: 20, 22, 23, 32, 33, and 34. Therefore, the final analysis 
was based on 29 of the 40 original items. The minimum and maximum score values for the 
range of inclusion were then readjusted to 12 and 28, respectively. 

A second analysis of the data was conducted using all of the examinees in the original pool 
with the exception of the five without identification numbers. The number of examinees that 
scored below 12 was 205. Because no examinee scored above 28, the calibration was 
performed using the test results for 1,405 examinees. 

The final analysis was made after eliminating examinees associated with t statistics larger in 
absolute value than 1.50. There were 103 examinees removed for this reason, leaving a total 
sample size of 1,507. After removing 205 examinees who scored below 12, 1,302 examinees 
remained for the final calibration. The mean ability expressed in logits for these examinees 
was .64 and the standard deviation was 1.05. 

Presented in Table 3 are the items, item difficulties, item discrimination indices, and standard 
deviations. The mean difficulty index for the 29 items was 0.00 with a standard deviation of 
1.54. The mean discrimination index was 1.07 with a standard deviation of 0.19. 



Insert Table 3 about here 



The test characteristic curve is presented in Figure 4. The reader should note the data 
summarized in Figure 4 provides the ability estimate expressed in logits along with the 
corresponding raw score. 



Insert Figure 4 about here 
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On the basis of the estimated performance level for each examinee, the examinees were then 
classified as belonging to one of tour performance level groups. Performance level groupings 
were defined as: Group 1, those with performance estimates less than -.30; Group 2. those 
with performance levels in the range of -.30 to .63 excluding .63; Group 3, those with scores 
in the range of .63 to 1 .9, excluding 1 .9; and Group 4: those with scores equal to or greater 
than 1.9. This is equivalent to subdividing the groups such that the 77 examinees in Group 1 
were in the performance level range of scores less than one standard deviation below the mean; 
the 417 examinees of Group 2 were in the performance range between one standard deviation 
below the mean and the mean; the 632 examinees of Group 3 were in the performance level 
range of the mean to one standard deviation above the mean; and the 176 examinees of Group 
4 scored at or higher than one standard deviation above the mean. 

The BICAL3 program produces a tile of examinees whose item responses were used in the 
calibration process. This file contains the performance score, recorded in logits, for each 
examinee. Because those examinees who had raw scores below 12 were excluded, the number 
of examinees in Group 1 is smaller than for the other groups. This resultant file was then 
matched with the original tile of responses that also included the examinee identification code, 
the examinee's item response pattern, and the performance score for the reduced item set. An 
additional program written by the investigators produces yet another file that contains the 
response pattern for each examinee for the reduced 29 item test and a vector of zeroes and 
ones to indicate group membership. 

The neural net coasisted of an input layer with 29 nodes (one for each item), an output layer 
with four nodes (one for each performance group), and one hidden layer. For this particular 
analysis, 10 nodes were selected to be in the one hidden layer. 

The neural net was trained with 400 examinees randomly selected from the 1,302 subjects 
used to calibrate the 29 item test. After the weights were established, two different 
independent samples of 200 were randomly chosen from the remaining 902 examinees. The 
results of the classification for each sample are shown in Table 4. 



Insert Table 4 about here 



Conclusions 

The output from the Neural Works software is provided in Appendix A. The data for each 
examinee appears in one of two lines. The first line is the group placement as obtained from 
the BICAL3 procedure. The second line contains the output node values from the 
Neural Works software. If the group indicator code (1 ) of line one appears directly above (he 
value closest to 1.0 in the second hue. the classifications correspond and a success was 
obtained. If this is not the case, the Neural Works placement of the examinee is considered a 
failure. 



The neural net procedure did manage to classify 80° o of ihc subjects into the performance 
level groupings determined by the BICAL3 program, 'llie most often missed classification was 
in the lowest performance level group. It is quite possible the examinees in this group managed 
to correctly answer some of the items with greater difficulty indices by guessing. This group 
also had the least number of subjects because of the cut-off point for performing the test 
calibration with BICAL3. Examinees having fewer than 12 items correct were not used in the 
calibration process. Therefore the neural net may not have been sufficiently trained to 
recognize examinees who would fall into this group. 

Another difficulty was that the standard error of measurement for the ability of the examinee 
may have been too high. This would suggest that further work on the test may be needed and 
that the 80% success ratio by the neural net is the best one could expect. 

It is quite possible that varying the parameters associated with the neural net may improve its 
performance. Changing the number of nodes in the hidden layer or increasing the number of 
hidden layers are possible variations from what the researchers have done that would improve 
the success ratio. The function used does not have to be the sigmoid function. Any function 
whose derivative can be expressed in terms of the original function and is bounded above and 
below is a candidate. The hyperbolic tangent is one such function. 

This research does indicate that the neural net may be applied to testing. Although the neural 
net used in this research was trained to recognize an answer to an item as either correct or 
incorrect, it is possible to train a neural net to recognize that an item can be correct, incorrect 
or omitted. A subject could then be classified having been given a small subset of the available 
items. 

Another application would be to train the neural network on a small number of items with 
widely varying difficulty indices. The examinees could be classified into groups using the 
neural net. At this point, examinees in a particular group could be given items with difficulty 
levels suitable for the group to obtain estimates of the ability of each subject. 

It should be noted that the investigators did not write the test instrument used for this research. 
Although classical difficulty and discrimination indices along with reliability information has 
been used this instrument was constructed by individuals having limited measurement 
expertise. The investigators were given the data base in response to a request made on the 
basis of believing this type of data would be appropriate for the procedure of interest. An 
added difficulty was that the more recent ERT programs were not available. It may well be that 
future investigations using more recent programs and a stronger test instrument may produce 
more encouraging results. 
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Table 1 



Sum mary Statistics for th e Prin cipal Components A nalysis 



Factor 


Eigenvalue 


Percentage of Variance 


Cumulative Percentage 


1 


6.75 


16.9 


16.9 


2 


1.55 


3.9 


''O 8 


3 


1.44 


3.6 


24.4 


4 


1.13 


2.8 


27.2 


5 


1.09 


2.7 


29.9 


6 


1.06 


2.7 


32.6 


7 


1.05 


2.6 


35.2 


8 


1.00 


2.5 


32.7 


Table 2 








Summary Statistics for the Principal Axis Analysis 












Factor 


Eigenvalue 


Percentage of Variance 


Cumulative Percentage of Variance 


1 


5.99 


15.0 


15.0 


2 


0.75 


1.9 


16.8 


3 


0.60 


1.5 


18.3 



1.) 

Table 3 

Item Parameter Estimates for the 29 Retained Items 



Item 


iJilliculty 


Standard Error 


Diseri 


1 


o 


0. 13 


1 \i 

1.47 




1 AO 

-1.08 


A AO 

0.08 


1 1 ■"> 

1.13 


$ 


a AO 
-0.93 


A AO 

0.08 


A AO 

0.93 


4 

4 


-0.2o 


a at 

0.0 / 


A OT 

0.8 / 


< 

j 


A O "3 

-0.23 


A AT 

0.07 


A T/T 

0.76 


/ 


a i /' 
-U. lo 


A AT 

0.07 


1 AC 

1.05 


V 


1 A A 

-1.1)4 


A AO 

0.08 


/ \ AO 

0.98 


i a 

10 


-0.27 


A A""X 

0.07 


1.22 


1 1 

I I 


O 3 A 

-2.39 


A 1 "5 

0.13 


1 1 c 

l .15 


1 0 

12 


-2.72 


A 1 C 

0.15 


0.63 


I 3 


A OO 

-U.83 


A AO 

O.Oo 


1 1/4 

1.44 


14 


O 1 4 

-2.14 


A 1 O 

0.12 


1.21 


1 c 
L J 


-1.31 


A AA 

0.09 


1 1 A 

1.19 


lo 


A AO 
U 03 


A A/T 

0.06 


1.35 


1 T 

1 7 


1 1 A 

-l.lv 


f\ AA 

0.09 


1 1 O 

1.18 


1 X 


A c c 

-0.55 


A AT 

0.07 


0.96 


o 1 
21 


a to 

-0.72 


A AO 

0.08 


0.85 


24 


A AO 

0.93 


0.06 


1.14 


25 


1.62 


0.06 


1.16 


26 


1.14 


0.06 


1.09 


28 


1.96 


0.07 


0.92 


29 


3 ^4 


0 09 


0 9"> 


30 


0.28 


0.06 


1.12 


31 


1.12 


0.06 


0.91 


35 


2.19 


0.07 


1.07 


36 


2.18 


0.07 


1.04 


38 


1.83 


0.06 


1.09 


40 


1.71 


0.06 


1.08 


Mean 


0.00 




1.07 




_1.54_ 




__0.J_9_ 
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Table 4 

A . Sum mary of the Cl assification Suc cess Using a Ne ural N et 



Sample Nof Successes N of Failures Success Ratio 

T~ ~ 160 " 40 7800 

2 161 39 .805 
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